Electronic excited states in deep variational Monte Carlo

Obtaining accurate ground and low-lying excited states of electronic systems is crucial in a multitude of important applications. One ab initio method for solving the Schrödinger equation that scales favorably for large systems is variational quantum Monte Carlo (QMC). The recently introduced deep QMC approach uses ansatzes represented by deep neural networks and generates nearly exact ground-state solutions for molecules containing up to a few dozen electrons, with the potential to scale to much larger systems where other highly accurate methods are not feasible. In this paper, we extend one such ansatz (PauliNet) to compute electronic excited states. We demonstrate our method on various small atoms and molecules and consistently achieve high accuracy for low-lying states. To highlight the method’s potential, we compute the first excited state of the much larger benzene molecule, as well as the conical intersection of ethylene, with PauliNet matching results of more expensive high-level methods.

Obtaining accurate ground and low-lying excited states of electronic systems is crucial in a multitude of important applications. One ab initio method for solving the Schrödinger equation that scales favorably for large systems is variational quantum Monte Carlo (QMC). The recently introduced deep QMC approach uses ansatzes represented by deep neural networks and generates nearly exact ground-state solutions for molecules containing up to a few dozen electrons, with the potential to scale to much larger systems where other highly accurate methods are not feasible. In this paper, we extend one such ansatz (PauliNet) to compute electronic excited states. We demonstrate our method on various small atoms and molecules and consistently achieve high accuracy for low-lying states. To highlight the method's potential, we compute the first excited state of the much larger benzene molecule, as well as the conical intersection of ethylene, with PauliNet matching results of more expensive high-level methods.
The fundamental challenge of quantum chemistry, solid-state physics, and many areas of computational materials science is to obtain solutions to the electronic Schrödinger equation for a given system, which in principle provides complete access to its chemical properties. The ground and low-lying excited states typically determine the behavior of a system and are therefore of the most interest in many applications. Understanding and being able to describe excited-state processes 1 , including a wide variety of important spectroscopy methods such as fluorescence, photoionization, and optical absorption of molecules and solids, is key to the successful design of new materials.
Unfortunately, the Schrödinger equation cannot be solved exactly except in the simplest cases, such as one-dimensional toy systems or a single hydrogen atom. Accordingly, many approximate numerical methods have been developed which provide solutions at varying degrees of accuracy. Time-dependent density functional theory 2,3 (TDDFT) is the most popular method due to its computational efficiency but has well-known limitations [4][5][6][7][8][9] . Higher-accuracy methods have a computational cost that scales rapidly with system size-the well-established full configuration interaction 10 (FCI) and coupled cluster 11 (CC) techniques scale ∼ Oð exp ðNÞÞ (FCI scales exponentially, while truncated CI scales polynomially.) and ∼ OðN 5À10 Þ (The scaling of CC depends on the particular method used: CC2 OðN 5 Þ, CCSD OðN 6 Þ, CCSD(T) OðN 7 Þ, CC3 OðN 7 Þ, CCSDT OðN 8 Þ, CCSDT(Q) OðN 9 Þ, CCSDTQ OðN 10 Þ.) respectively, where N is the number of electrons, thereby severely limiting their practical use. There is thus a huge need for ab initio methods that scale more favorably with system size, allowing the modeling of practically relevant molecules and materials.
Quantum Monte Carlo (QMC) techniques offer a route forward with their favorable scaling (OðN 3À4 Þ) and therefore dominate highaccuracy calculations where other methods are too expensive 12,13 . A state-of-the-art QMC calculation typically involves the construction of a multi-determinant baseline wavefunction through standard electronic-structure methods, which is augmented with a Jastrow factor to efficiently incorporate electron correlation, and then optimized through variational QMC (VMC) to obtain a trial wavefunction. This is then used within fixed-node diffusion QMC (DMC) to obtain a final electronic energy. The fixed-node approximation is used to avoid exponential scaling, with the drawback that the nodal surface of the trial wavefunction cannot be modified, which limits the accuracy of the DMC result 14 . A more expressive baseline wavefunction can improve upon this but traditional DMC often needs thousands to hundreds of thousands of determinants to reach convergence 15 . Additionally, DMC only provides the final energy, restricting the calculation of other electronic properties 16 . Both of these limitations can, in principle, be resolved at the VMC level, with the accuracy of VMC constrained only by the flexibility of the trainable wavefunction ansatz. So far, these techniques have mostly been developed for ground-state calculations, with different extensions proposed to address excited states 12,[17][18][19][20][21][22][23][24][25][26] .
Recently, the new ab initio approach of deep VMC methods has been introduced [27][28][29][30] and subsequently further extended and improved [31][32][33] . In particular, PauliNet 27 and FermiNet 28 were the first methods to demonstrate that highly accurate ground-state results for molecules could be obtained using deep VMC with lower computational complexity and using orders of magnitude fewer Slater determinants than typically employed in other methods that achieve similar accuracy.
In the same spirit as Carleo and Troyer proposed for optimizing quantum states in lattice models 34 , VMC is used in order to train a neural network model that represents the many-body wavefunction in an unsupervised fashion, i.e. in contrast to other quantum machine learning approaches the only input to the method is the Hamiltonian, and training data are generated on the fly by sampling from the current wavefunction model and minimizing the variational energy. In both PauliNet and FermiNet deep antisymmetric neural networks are used to represent the fermionic wavefunction in the real space of electron coordinates.
Recently, there has been much interest in developing deep learning methods for excited states 35 . In this paper, we extend PauliNet towards the ab initio computation of electronic excited states (see the "Methods" section for details). The input is again only the Hamiltonian of the quantum system. By employing a simple energy minimization and numerical orthogonalization procedure, we are able to obtain the lowest excited-state wavefunctions of a given system. The excitedstate optimization makes use of a penalty method that minimizes the overlap between the nth excited state and the lower-lying states in the spectrum. Optimization methods that introduce additional constraints have been used in the context of VMC before 26 and provide a simple way to obtain orthogonal states without explicit enforcement in the wavefunction ansatzes. Combining these techniques with the expressiveness of neural network ansatzes yields highly accurate approximations to excited states with direct access to the wavefunctions for the evaluation of electronic observables. Neural networkbased methods have targeted low-lying excited states of onedimensional lattice models 25 , but have not been applied to firstprinciples systems.
We demonstrate our method on a variety of small-and mediumsized molecules, where we consistently achieve highly accurate total energies, outperforming traditional quantum chemistry methods. We also compute excitation energies, transition dipole moments, and oscillator strengths, the main ground-to-excited transition properties, with the latter two known to be more sensitive to errors in the underlying wavefunctions than energies. In all test systems, we find PauliNet closely matches high-order CC and experimental results. Next, we show that our method can be applied in a straightforward manner to much larger molecules, using the example of benzene where we match significantly more expensive high-level electronicstructure methods. Finally, we demonstrate that PauliNet can be used to compute excited-state potential energy surfaces by modeling an avoided crossing and conical intersection of ethylene, a highly multireferential problem.

Nearly exact solutions for small atoms and molecules
To demonstrate our method we start by applying it to a range of small atoms and molecules. We optimize the lowest-lying excited states and compute their vertical excitation energies for the ground-state equilibrium geometry (see Supplementary Table I), with each PauliNet wavefunction containing a maximum of 10 determinants. In all systems, we obtain highly accurate total energies and estimates of the first few excitation energies competitive with high-accuracy quantum chemistry methods.
In Fig. 1 the excitation energies of the lowest states are shown for several atoms. For all the atoms the excitation energies are obtained within 4 mHa of the theoretical best estimates (TBE) 36 . Due to the high degree of symmetry the atoms exhibit degeneracies, that is, multiple orthogonal states can be found with the same energy. Being subject to the orthogonalization constraint, PauliNet approximates all orthogonal states of an energy level individually, which is observed by attaining multiple results at the same energy level. The multiplicity of the exact solution can be obtained theoretically by considering the electronic configurations of the atoms and is reproduced within our experiments.
We then compute a larger number of excited states for LiH, BeH and Be. In each experiment, we optimize eight ansatzes in parallel. In Fig. 2 we illustrate the training process by plotting the convergence of the total energies and excitation energies. Additionally, we plot the training estimates of the pairwise overlaps of the wavefunctions, which remain small throughout the optimization process. We confirm that the final overlaps are near-zero by exhaustively sampling the trained wavefunctions, thereby obtaining well-converged Monte Carlo estimates (see Supplementary Table VI). Based on the degeneracies we find a total of five (LiH), four (BeH), and three (Be) distinct excitation energies, respectively. The excitation energies match those from reference values, and in particular, we find that for all systems studied here we reliably obtain the first excited state, and apart from one case also the second excited state. However, especially for clusters of higher-lying excited states with similar energies, we typically do not find all members of the cluster. In these cases, which states are found depends on the initialization of our ansatzes, as well as the total number of states that are being sought. To give a transparent picture of the capabilities of our method, in this work we have refrained from optimizing the CASSCF baseline in order to find all possible excitations. PauliNet results for the excitation energies (with (red) and without (yellow) variance matching (see the "Methods" section for details)) are compared to the theoretical best estimates (TBE) taken from the NIST database 36 . Multiple PauliNet ansatzes with identical energies correspond to orthogonal degenerate states. For the TBE we have depicted four excitations per atom, taking account of the degeneracies. For all atoms, we find the first excited state with high accuracy. For B, C, and O the ground state is threefold degenerate. For these systems we choose one of the three states to compute excitation energies, resulting in transitions with a relative energy of zero. For Li and Be a further excitation energy is found. While we obtain the second excited state for Be, in Li we miss out intermediate states and instead find the transition from the ground state to the 2 D state. This can be related to the generic CASSCF initialization of the ansatzes (see the "Methods" section for details). (The numerical data can be found in Supplementary Table II; source data are provided as a Source Data file).

Highly accurate wavefunctions: transition dipole moments and oscillator strengths
Total energies and vertical excitation energies are the primary focus when benchmarking excited-state methods as they are readily available from many theoretical models and provide a good initial guess of a particular method's accuracy. However, they provide only a partial characterization of the electronic states, and while a method in question may give accurate energies, other quantities of key importance may be inaccurate [37][38][39] .
Transition dipole moments (TDM) and oscillator strengths are two principal ground-to-excited transition properties and are of great interest. TDMs determine how polarized electromagnetic radiation will interact with a system due to its distribution of charge, and therefore determine transition rates and probabilities of induced state changes. In the electric dipole approximation, the TDM between two states i and j is given by whereμ = P k qr k is the sum over the position operator of each particle weighted by its charge, with q = −e for electronic systems. We obtain the expectation value by Monte Carlo sampling according to Eq. (15). While the TDM is important for understanding a number of processes, including optical spectra, it is generally a complex-valued vector quantity and not an experimental observable by itself. The closely related oscillator strength is what is inferred through the experiment and is given by where ΔE is the excitation energy between states i and j, and d 2 ij is the dipole strength. It is known that, in addition to being more basis-set sensitive, d ij and f ij are both highly dependent on the quality of the trial wavefunctions 40 and represent a more rigorous test for ab initio methods than just energies.
Recently, transition energies and oscillator strengths for a variety of small molecules have been computed using high-order CC calculations, systematically extrapolating to the complete basis set (CBS) limit, and comparing to experimental results where possible, in order to supply a comprehensive set of theoretical benchmarks 41,42 . In that spirit, we now use these results to benchmark the accuracy of oscillator strengths computed using PauliNet. Furthermore, we also compare multi-reference CC (MR-CC) results where possible 43 . We compute the first few electronic states for five molecules (BH, CH + , H 2 O, NH 3 , CO), such that we obtain the first non-zero oscillator strength (within the dipole approximation) for each. All calculations (CH + was not included in the CC calculations in refs. 41,42 . We instead compare to (MR-)CC results in ref. 43 , using the same ground-state equilibrium geometry, which was obtained in a split-valence basis augmented with diffuse and polarization functions. See refs. 43,44 for more details.) are performed at the same ground-state equilibrium geometries as refs. 41,42 (see Supplementary Table I) and using the same number of determinants (≤10) as in the section "Nearly exact solutions for small atoms and molecules".
Our results for all systems are shown in Fig. 3. First, we compute the amount of correlation energy recovered in the ground state and find PauliNet matches high-order CC methods (panel a). Second, we compute the excitation energy for each transition and find this to be close to the TBE, on par with CC and much more consistent than TDDFT where the accuracy depends on the molecule and on the exact TDDFT method used (panel b). Finally, we compare the oscillator strengths (for the 0 → 2 transition) in panel c. Even high-order methods such as CC and MR-CC can produce a spectrum of results depending on the expansion and basis set used, with this exacerbated in cheaper methods such as TDDFT (see the example of CO). In all systems, Pau-liNet compares well with experimental results, demonstrating the quality of deep VMC wavefunctions with just a minimal number of determinants.

Application to larger molecules
The previous two sections showed that we achieve highly accurate results across a range of small systems. While this is encouraging, traditional high-accuracy methods that are better established are readily available for such small systems. In this section, to demonstrate the potential of excited PauliNet, we show that it can be applied in a straightforward manner to significantly larger molecules. For this objective, we choose the example of the benzene molecule (panel a of Fig. 4). Studies of its electronic structure and other properties are plentiful due to its importance in bio and organic chemistry, and with 42 electrons all-electron calculations will be extremely demanding or even intractable for a high-level from FCI calculations and other highly accurate references 36,[57][58][59] . Due to the initialization from the CASSCF baseline, the wavefunctions start with a small overlap, which is retained throughout the optimization. (The numerical data can be found in Supplementary Table II; source data are provided as a Source Data file).
description of its electronic states, depending on the theory level used.
Using a PauliNet ansatz with just 10 determinants, the same as in the much smaller systems, and slightly deeper neural networks (see Supplementary Table VII) we obtain very good total energies for the ground state and first excited state (upper left of Fig. 4). We note the better accuracy than high-level CC calculations, with this signifying highly accurate wavefunctions that can be used to compute other observables, as demonstrated in the previous section. The computed excitation energy is also shown (right of Fig. 4), with PauliNet compared against several experimental and theoretical results. The lower experimental result 45 (dashed black line) quantifies an adiabatic excitation energy, i.e. the energy difference between the ground state and the excited state at the corresponding relaxed geometries. This quantity is corrected to obtain the vertical excitation energy 26 (solid black line), which omits nuclear relaxation and vibrational effects. As our calculations are performed at the ground-state equilibrium geometry, we are targeting the vertical excitation energy, and therefore consider this corrected experimental result to be closer to the ground truth. We find this to be slightly underestimated by high-order methods (CC, DMC), and slightly overestimated by PauliNet. In other systems (panel b of Fig. 3) we notice a similar trend when comparing to the TBE.
PauliNet formally scales as OðN 4 Þ with the number of electrons N, and in practice, we observe a scaling behavior OðN 3 Þ for the systems investigated so far, which is related to quadratic scaling of the neural network with an extra factor from the evaluation of the local energy. As PauliNet is currently implemented in a research code, which is not optimized for production purposes, the computational time will have a large prefactor which makes it computationally unfavorable to, e.g. CC methods for small molecules. However, its very favorable scaling in N compared to OðN 5À10 Þ of high-level electronic-structure methods dominates for larger molecules, and this is clearly visible in benzene. For instance, ref. 46 used several state-of-the-art methods to obtain accurate benzene ground-state energies, with calculations run on several CPU types in a highly parallel manner (see Supporting Information of ref. 46 for details). PauliNet was run on a single RTX 3090 GPU at a fraction of the number of node hours. Although Pauli-Net is the computationally cheapest method in this comparison, it provides a significantly better (variational) ground-state energy than all methods (~0.48 Ha lower). As all methods compared in Fig. 4 provide similar excitation energies, these cannot be used to group the methods into more or less accurate, but overall this data indicates that PauliNet and deep VMC methods in general have a very favorable cost/ accuracy trade-off for molecules of the size of benzene and beyond.

Multi-reference application: conical intersections
Molecular configurations that produce electronic states with similar energies are fundamental in photochemical applications. Such configurations can lead to several states mixing, meaning they are all necessary for an accurate description of a particular process. Conical intersections are produced when two states become degenerate and require the computation of excited-state potential energy surfaces.  The modeling of energy surfaces near degeneracies is inherently multireference with significant electronic correlation and is thus a challenging application for electronic-structure methods.
As a final application of excited PauliNet, we compute ground-and excited-state potential energies for ethylene (H 2 C=CH 2 ) as a function of its torsion and pyramidalization angles (see inset of Fig. 5). Twisting around the C=C bond raises the energy of the ground state while lowering that of the first-excited singlet state, giving rise to an avoided crossing at a torsion angle τ of 90°. From this twisted structure, the energy gap between the two states is further reduced through the pyramidalization of one of the CH 2 groups, leading to a conical intersection. These potential energy curves, whose modeling is often too challenging for single-reference methods [47][48][49] , have been characterized using multi-reference configuration interaction (MR-CI) methods 50 which we use for comparison.
We choose the same ground-state (planar) geometry as ref. 50 (optimized using a small CAS and the aug-cc-pVDZ basis set; see Supplementary Table I) and find the excitation energy between the ground state and first-excited singlet state to be within a few mHa of the MR-CI results. As we vary τ, while keeping all other geometric parameters fixed, we find the energy curves to be well reproduced by PauliNet, with an avoided crossing at τ = 90°(panel a of Fig. 5; curves symmetric about τ = 90°). Single-reference methods, such as TDDFT (see figure), often overestimate the energy of the ground state at τ = 90°(barrier) and produce an unphysical cusp.
Next, we take the same twisted structure (τ = 90°) as ref. 50 (optimized using a small CAS and the aug-cc-pVDZ basis set; see Supplementary Table I) and vary the pyramidalization angle ϕ, while keeping all other geometric parameters fixed. While there is a small discrepancy between PauliNet and the MR-CI results (panel b of Fig. 5), the trend of the energy curves is well described, including the correct minimum of the excited-state curve (~70°) and the conical intersection (PauliNet: ϕ~100°; MR-CI: ϕ~96°). We note that many singlereference methods are unable to even qualitatively describe the conical intersection, instead predicting spurious features 49 .

Discussion
We have introduced an approach to compute highly accurate excitedstate solutions of the electronic Schrödinger equation for molecules by using deep neural networks that are trained in an unsupervised manner with variational Monte Carlo. We have employed the PauliNet architecture 27 to approximate the ground-and excited-state wavefunctions, however other architectures such as FermiNet 28 or second quantization approaches 29 could also be employed, with suitable modifications. As our approach to find excited states only constrains the excited-state wavefunctions, the ability to compute highly accurate and variational absolute ground-state energies is unchanged. In addition, we demonstrate for a number of small molecules containing up to 42 electrons, that excited PauliNet can reliably find the first excitation energies with an accuracy that is on par with high-level electronicstructure methods, whereas cheaper methods such as TDDFT are less consistent in approximating these energies. The accuracy of the excited-state wavefunctions is underlined by an accurate match of oscillator strengths, which depend on the transition dipole moment, a quantity that is more sensitive to the exact form of the wavefunction than the energy. For benzene (42 electrons), PauliNet already requires significantly less computational time than higher-order methods, and this advantage will only improve for larger molecules. Formally, a single PauliNet is an OðN 4 Þ method for N electrons, due to the computational cost of the Hartree-Fock or CASSCF baseline, however, in practice we empirically observe an OðN 3 Þ dependency for the system sizes tested, as discussed above. In addition, for excited-state calculations n PauliNet replicas are used which gives rise to OðnN 3 Þ + Oðn 2 N 2 Þ, with the latter term arising from the pairwise overlaps and having a much smaller prefactor than the former.
Notably, almost identical excited PauliNet architectures are used across the systems shown in this paper-up to minor modifications such as the budget of Slater determinants and the total number of excited states requested, and a deeper network for benzene to adapt for a potentially more complex wavefunction. Whereas a skilled quantum chemist can usually tune and specialize an existing electronic-structure method to give very high-accuracy results for a given molecule, our aim is the exact opposite: to provide a method that, by leveraging machine-learning tools, is as automated as possible and will work over a wide range of Hamiltonians provided.
We have demonstrated that we can compute ground-and excitedstate potential energy surfaces with the example of ethylene where we model an avoided crossing and conical intersection. Here, where single-reference methods often fail, PauliNet performs well against multi-reference CI results. By combining the present approach with recent and ongoing extensions of PauliNet 32 and FermiNet 33 that variationally compute entire potential energy surfaces, both highly accurate ground-and excited-state energy surfaces are now accessible with deep VMC methods. Future work will investigate the application of PauliNet to other interesting processes where molecular dynamics interacts with excited states.
One of the limitations of the current approach is that it appears difficult to reliably find all excited states up to a given desired number, especially in cases where several excited states have similar energies. This is a complex problem that depends on the Hartree-Fock/CASSCF initialization, on the total number of states requested, on the learning algorithm, and the expressiveness of the architecture and will be studied in more detail elsewhere. However, the first excited state could be reliably found for all molecules studied here, and apart from one exception also the second excited state. This, in combination with the a Total energies (relative to the ground state of the planar geometry E 0 ) of the ground state and first-excited singlet state as a function of torsion angle τ, with MR-CI 50 and TDDFT 48 results also plotted for comparison. TDDFT overestimates the barrier (the ground state at τ = 90°) and produces an unphysical cusp, while the MR-CI results which predict an avoided crossing are well reproduced by PauliNet. b Same as above but as a function of pyramidalization angle ϕ (τ = 90°), with the degeneracy of the two states producing a conical intersection. The arrows denote the conical intersection, with PauliNet (ϕ~100°) closely matching the MR-CI result (ϕ~96°). Note: The geometric parameters (bond lengths and angles) vary slightly between the torsion and pyramidalization experiments (see ref. 50 ). (The numerical data can be found in Supplementary Table V; source data are provided as a Source Data file).
high numerical accuracy and the favorable computational cost, makes deep VMC a promising method to compute both ground-and excitedstate properties for small-and medium-sized molecules with dozens or even low hundreds of electrons.

PauliNet ansatz
At the heart of our approach is the PauliNet ansatz, introduced in ref. 27 and further refined in ref. 51 , a multi-determinant Slater-Jastrow-backflow type trial wavefunction that is parametrized by highly expressive deep neural networks: where r = (r 1 , . . ., r N ) is the 3N-dimensional real space of electron coordinates. The structure of our ansatz ensures that the correct physics is encoded: the wavefunction obeys exact asymptotic behavior through the fixed electronic cusps γ, and is antisymmetric with respect to the exchange of like-spin electrons through the use of generalized Slater determinants, guaranteeing the Pauli exclusion principle is obeyed.
The expressiveness of PauliNet is contained in the Jastrow factor J θ and backflow f θ , which introduce many-body correlation, and are both represented through deep neural networks (denoted by trainable parameters θ). J θ and f θ are constructed in ways that preserve the antisymmetry of the fermionic wavefunction with respect to exchanging like-spin electrons, as well as its cusp behavior. The Jastrow factor is an exchange-symmetric function, and captures complex correlation effects through augmenting the Slater-determinant baseline, but is incapable of modifying the nodal surface of the determinant expansion. Changes to the nodal surface are possible through the backflow, which acts on the single-electron orbitals φ μ directly, transforming them into permutation-equivariant many-electron orbitalsφ μ . f θ is split into multiplicative (m) and additive (a) components (Eq. (4)), and is designed to be equivariant under the exchange of like-spin electrons.

Ground-state optimization
Like traditional VMC methods, PauliNet is based on the variational principle, which guarantees that the energy expectation value of a trial wavefunction ψ θ is an upper bound to the true ground-state energy: For a given system, a standard quantum chemistry method (Hartree-Fock (HF) for a single determinant; complete active space self-consistent field (CASSCF) for multiple determinants) is performed, with the solution supplemented by the analytically-known cusp conditions, thus producing a reasonable baseline wavefunction. We then optimize the PauliNet ansatz by minimizing the total electronic energy (serving directly as the loss), following the standard VMC trick of evaluating it as an expectation value of the local energy, E loc ðrÞ =ĤψðrÞ=ψðrÞ, over the probability distribution |ψ θ | 2 : This means that, in practice, we alternate between sampling electron positions generated using a Langevin algorithm with the probability of the trial wavefunction serving as the target distribution, and optimizing the trial wavefunction parameters using stochastic gradient descent. For further details, see ref. 27 .

Computing excited states
We now introduce the central idea of this paper: a deep VMC method to compute the ground and low-lying excited states of a given electronic system. While we employ PauliNet to represent the individual wavefunctions, the method can also employ FermiNet or other realspace wavefunction representations with suitable modifications.
In a similar spirit to the ground-state optimization process, we first obtain a reasonable baseline for each state by performing a minimal state-averaged CASSCF calculation. This optimizes the energy average for all states in question and yields a single set of orbitals to construct each multi-determinant wavefunction, which in turn are supplemented by the analytically-known cusp conditions. We fix the number of determinants in our ansatz by cutting off the CASSCF expansion based on the absolute values of their determinant coefficients. The choice of the CASSCF baseline ensures that the PauliNet ansatzes for the different excited states are close to orthogonal upon initialization. In contrast to the ground-state calculation, the optimization of excited states requires a more nuanced choice of active space. In principle, we must ensure that the solutions contain determinants with orbitals of the necessary rotational symmetries (the Jastrow factor and backflow correction are rotationally-symmetric modifications of the orbitals) and spin configurations (the choice of the number of spin-up and spin-down electrons does impose restrictions on the states that may be attained by our ansatz). For most systems studied in this paper, a generic choice of the active space was sufficient (see Supplementary Table VIII) and we have not studied the dependence on the CAS initialization in more depth. As shown in previous studies the quality of the orbitals has only a minor effect on the training and does not change the final energy 51 . If, however, the initialization is not accounted for and the baseline solutions provide a qualitatively wrong spectrum of excited states our ansatzes may be trapped in local minima and miss intermediate excited states (see Fig. 2), even though we keep the Slater-determinant coefficients c p and linear coefficients c μk of the single-electron orbitals φ μ (r i ) = ∑ k c μk ϕ k (r i ) trainable.
Our objective is to calculate the lowest n eigenstates of a given system, that is, find the set of orthogonal states that minimizes the energy expectation value. We approach this challenge by introducing a penalty term to the energy loss function (Eq. (6)) and optimizing the joint loss for n PauliNet instances: where E i = E r ∼ |ψ θ,i | 2 and S ij is the pairwise overlap between states i and j. The functional form of the overlap penalty is chosen to diverge when two states collapse and behave linearly when states are close to orthogonal (see the next section for details). This allows states to overlap during the optimization procedure while preventing their collapse and eventually driving them to orthogonality when they have settled in a local minimum of the energy. The hyperparameter α weights the two loss terms and can be increased throughout the training to strengthen the orthogonality condition when approaching the final wavefunctions. For a sufficiently large α, the true minimum of the loss function corresponds to the sum of the energies of the lowestlying excited states with these states having no overlap. Thus, optimizing the penalized loss function (Eq. (7)) leads to an unbiased convergence towards the lowest-lying excited states (see below). In practice a small α is typically sufficient, making a robust choice possible.
To stabilize the training and reduce the computational cost we detach gradients in such a way that we only consider the overlap with the lower-lying states respectively, that is, the ground state is subject to unconstrained energy minimization and the nth excited state introduces n pairwise penalty terms. We compute the overlap of the unnormalized states i and j as the geometric mean of the two Monte Carlo estimates, obtained over distributions |ψ θ,i | 2 and |ψ θ,j | 2 , respectively: The sign of the overlap can be obtained from either of the two estimators, which match in the limit of infinite sampling. If the overlap is close to zero and the signs of the two estimates differ due to statistical noise of the sampling, we consider the states to be orthogonal. Similar to the energy loss, the gradient (We employ gradient clipping to stabilize the training.) of the pairwise overlap can be formulated such that it depends on the first derivative of the log wavefunction with respect to the parameters only (see below for details).
Finally, we note that different states may be modeled at different levels of quality, which can lead to erroneous excitation energies. In order to improve the error cancellation of our ansatzes we employ a variance-matching technique. As the variance of the energy σ 2 can be considered a metric of how close a wavefunction is to a true eigenstate, variance-matching procedures can be useful tools 21,52,53 . Here, we utilize a simple scheme: for single-state quantities such as total energies, we evaluate all wavefunctions at the end of training. For multi-state quantities, such as excitation energies or transition dipole moments, we match states of a similar variance. That is, if final ψ θ,i has a lower variance than final ψ θ,j , we take ψ θ,i at an earlier point in training. This simply involves computing σ 2 of the training energies and applying an exponential moving average at each iteration to monitor convergence (see below for details). We find this procedure typically improves the final results.

Loss function and overlap penalty
There are a number of choices of possible loss functions for the optimization of excited states in quantum Monte Carlo 20,26,54 . In order to assess the feasibility of excited-state optimization with deep neural network ansatzes in variational Monte Carlo we conducted a range of experiments with different types of optimization objectives. Our empirical findings showed that employing a penalty method is the conceptually most straightforward approach and gives stable results when combining it with our implementation of PauliNet. Initially, we started with an overlap penalty term similar to Pathak et al 26 . However, we found that our optimization could still collapse even if we chose a sufficiently large prefactor (α) and the training could not recover. We therefore switched to an alternative penalty term (Eq. (7)) which diverges upon a collapse of the states. The effect of our penalty term can be illustrated by considering the loss for a two-state system with the exact ground state |ψ 0 and a linear combination of the ground and first excited state |ψ 1 (see Fig. 6): The overlap and the energy can be obtained as In the vicinity of the orthogonal solution, the Taylor expansion of the penalty term is 1 1 À |S| À 1 = |S| + |S| 2 + |S| 3 + :::, at |S| = 0, ð11Þ that is, the overlap penalty behaves linearly to first order. This gives rise to a penalty that is locally stable for any prefactor, lower bounded by the S 2 penalty term, and diverges if states collapse. For a large enough α parameter the global optimum of the total loss is at zero overlap, that is, the optimization method is incentivized to find exactly orthogonal states without mixing. In practice, for the batch sizes used in our calculations, we have not observed a bias due to the non-linear nature of the penalty when applied to sampled expectation values of the overlap. However, it is expected that this is no longer the case in the limit of small batches. In order to elucidate how our loss function behaves in this regard, we compute the two lowest states of LiH using a range of different batch sizes (see Fig. 7). We find the optimization procedure to be robust for the large batch sizes that we typically employ (≥2000), with the excitation energy within 1 mHa of the exact, and the pairwise overlap remaining small throughout training (panel c). For smaller batch sizes, we observe a larger degree of statistical noise in the pairwise overlap, which leads to a less reliable approximation for the excited state and the corresponding excitation energy (panel b).  57,58 . While the optimization works well for the large batches that we typically employ in our calculations (≥2000), this becomes less reliable, at least for the excited state, in the limit of smaller batch sizes. Note: Darker corresponds to a larger batch size in each respective color. Fig. 6 | Sketch of the loss function. This figure illustrates the behavior of our loss function for a two-state system. The ground state is kept fixed and the second state is considered to be a linear combination of the ground state and first excited state (Eq. (9)). The scales are to be understood in arbitrary units, as they depend on the choice of hyperparameters and the energies of the system under investigation.

Gradient of the loss function
In order to differentiate the loss function we explicitly formulate the gradient. We consider the general case of a mixed observable: where N i , N j are the norms of the wavefunctions and E i = E r ∼ |ψ θ,i | 2 . By the property of Hermitian matrices, O ij = O ji , we derive an expression that does not depend on the wavefunction norms: = sgn E iÔ ψ θ,j ðrÞ ψ θ,i ðrÞ This expression reduces to the pairwise overlaps (Eq. (8)) upon settingÔ = Id. The derivative of this term can be expressed as where (i ⇔ j) is an additional term with the two indices interchanged. By considering the Hamiltonian operatorĤ and setting i = j we recover the gradient of the energy loss 27 : Variance matching As far as relative energies are concerned most computational chemistry methods rely heavily on the cancellation of error. While quantum Monte Carlo methods using neural network-based trial wavefunctions provide highly accurate total energies, the flexibility of these ansatzes is difficult to control which can lead to varying qualities of approximations for different states. In order to account for potential imbalances we utilize the variance of the wavefunctions as a measure of the quality of the approximation (zero-variance principle) and employ a variance-matching scheme. Variance-matching techniques as well as variance extrapolation have typically been applied by optimizing a family of ansatzes and comparing variances across the optimized wavefunctions 53 . Instead of training multiple ansatzes we checkpoint wavefunctions during the training and compute excitation energies by rewinding the ground state to match the variance of the excited state as depicted in Fig. 8. The mean and variance of each wavefunction are computed over the batch dimension at each step in training and smoothed with an exponential walking average. For the final estimation of excitation energies, the respective wavefunctions are then sampled exhaustively as in the usual evaluation process. While the variance matching hardly impacts the excitation energies for small systems, for larger and harder-to-optimize systems, such as benzene, it becomes increasingly relevant.

Spin treatment
PauliNet encodes only the spatial part of the wavefunction and its likespin antisymmetry explicitly 12 , while the spin part, which guarantees the opposite-spin antisymmetry, is only implicit. Every spin-assigned spatial ansatz such as PauliNet is always an eigenstate of S z with an eigenvalue of M = 1 2 ðN " À N # Þ, but it may not be an eigenstate of S 2 . The spatial part of eigenstates of S 2 is characterized by specific sets of permutational symmetries involving opposite-spin electrons 55 . Pauli-Net does not enforce these symmetries but instead attempts to learn them through the variational principle because eigenstates of the Hamiltonian are also eigenstates of S 2 . Therefore, we do not, in general, control the spin of the eigenstates found in the optimization procedure-they are simply found in the order of increasing energy, independent of spin. The spin of a found eigenstate can be obtained in principle by Monte Carlo sampling 56 . Whether a particular spin state is found in practice may be influenced by the spin of the CASSCF baseline wavefunction, which we, therefore, report in Supplementary Table VIII. In special cases, we may wish to target a specific spin state (e.g., see the section "Multi-reference application: conical intersections"), and for that, we can take advantage of the orbital-assigned backflow of Pauli-Net. Combined with the freezing of the determinant coefficients, this ensures that PauliNet remains in the same spin state as the CASSCF baseline wavefunction.

Data availability
The dataset generated in this study is openly available in Zenodo (https://doi.org/10.5281/zenodo.7274855). Source data are provided with this paper. Fig. 8 | Sketch of the variance-matching procedure. The excitation energy of the benzene calculation at step 4000 is obtained for illustration purposes. The variance (b) of the excited state is higher than that of the ground state and is therefore matched with the variance of the ground state at a previous iteration. The excitation energy is computed by comparing the mean energies (a) at the respective iterations. This acts to reduce the excitation energy and is found to improve the results in all of our experiments.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/ licenses/by/4.0/.